The Granularity of Soft-Error Containment in Shared Memory Multiprocessors
نویسندگان
چکیده
Enables flexibility in when to detect Case Study: HP NSAA HP’s NonStop Advanced Architecture (NSAA), although not a shared-memory multiprocessor, uses the memory containment granularity. Before performing disk or network I/O, NSAA compares redundant executions. Recovery is accomplished by reverting to a software-created backup process. Recovery Across I/O When coordinating checkpoints across all processors, a major challenge becomes the apparent need to recover across I/O. Nakano et al. observe that some common I/O operations are idempotent, which permits coordinated checkpoints for I/O intensive workloads (e.g., TPC-C or HTTP servers). Case Study: TRUSS The TRUSS server architecture provides a logical separation layer (Membrane) that enables processor cores and caches to locally checkpoint and recover, independent of other processors or memory. Each processor core and cache form a containment boundary such that all shared-memory interactions are guaranteed error-free. With hardware support for checkpointing and error detection, the TRUSS architecture provides software-transparent reliability for soft errors and a broad class of permanent faults. Other Examples: HP NonStop, Stratus, etc. Case Study: IBM z-series The IBM z-series mainframe uses a custom processor core that consists of two lockstepped, replicated pipelines. Before retiring an instruction, the results from both executions are compared, and if an error is detected, the instruction is re-executed. Case Study: Redundant Multithreading Recent work suggests that time-delayed redundant execution using SMT or CMP architectures can improve resource efficiency over lockstepped DMR. In these proposals, error detection is enforced either before the register file or before the cache. Both satisfy the core containment granularity. Soft-error rates (SER) increasing exponentially What are the key sources? Radiation: SER scales with transistor count Variability: SER increases with level of integration – Manufacturing: device variations within/across dies – Lifetime: transistor performance varies over time
منابع مشابه
Modeling and Performance Evaluation of Multi-Processors Organization with Shared Memories
This paper is primarily concerned with theoretical evaluation of the performance of multiprocessors system. A markovian waiting line model has been developed for various different multi-processors configurations, with shared memory. The system is analysed at the request level rather than job level.
متن کاملWorkload Characterization and Locality Management for Coarse-grain Multiprocessors 1 Workload Characterization and Locality Management for Coarse-grain Multiprocessors
Scalable shared memory multiprocessors commonly employ replication and the associated coherency maintenance of memory blocks, but diier in the granularity from ne-grain (cache-coherent multiproces-sors) to coarse-grain (page-based distributed shared memory systems). Regardless of the size of coherency blocks, attaining good performance may depend on the number of copies staying small. Previous ...
متن کاملA Dissertation Submitted to the Department of Computer Science and the Committee on Graduate Studies of Stanford University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy
Current shared-memory multiprocessors suffer from an inherent fragility, since a single hardware or system software failure can cause the entire machine to crash. This dissertation describes a combination of hardware and software techniques that can be used to provide fault containment for large-scale shared memory machines. With fault containment, the impact of a fault remains limited to only ...
متن کاملComparative Evaluation of Fine- and Coarse-Grain Approaches for Software Distributed Shared Memory
Symmetric multiprocessors (SMPs) connected with low-latency networks provide attractive building blocks for software distributed shared memory systems. Two distinct approaches have been used: the fine-grain approach that instruments application loads and stores to support a small coherence granularity, and the coarse-grain approach based on virtual memory hardware that provides coherence at a p...
متن کاملVirtual Clusters: Resource Mangement on Large Shared-memory Multiprocessors a Dissertation Submitted to the Department of Computer Science and the Committee on Graduate Studies of Stanford University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy
Despite the fact that large scale shared-memory multiprocessors have been commercially available for several years, system software that fully utilizes all of their features is still not available. These machines require system software that is scalable, supports fault containment, and provides scalable resource management. Software supporting these features is currently unavailable, mostly due...
متن کامل